Parallel Programming Laboratory

Charm++ Release 7.0.0

10/25/2021

PPL is happy to announce the latest stable version of Charm++, v7.0.0! This is a feature release with many new features, bug fixes, and performance improvements. You can view the release notes and download the source code or binaries here.

SC 21 Charm++ BoF

10/14/2021

PPL will be hosting a Birds of a Feather session titled "Charm++ and AMPI: Adaptive and Asynchronous Parallel Programming" at the SuperComputing 2021 Conference in St. Louis on Tuesday, November 16th, 2021. It will be held in-person in Room 227-228 at 12:15 - 1:15pm CST, with a Zoom stream available for those attending remotely.

Charm++ Release 6.10.2

08/05/2020

PPL is happy to announce the latest stable version of Charm++, v6.10.2! This is a bugfix release. You can view the release notes and download the source code or binaries here.

Charm++ Release 6.10.1

03/05/2020

PPL is happy to announce the latest stable version of Charm++, v6.10.1! This is a bugfix release. You can view the release notes and download the source code or binaries here.

Charm++ Release 6.10.0

02/12/2020

PPL is happy to announce the latest stable version of Charm++, v6.10.0! This is a feature release with many new features, bug fixes, and performance improvements. You can view the release notes and download the source code or binaries here.

Charm++ Release 6.9.0

11/12/2018

PPL is happy to announce the latest stable version of Charm++, v6.9.0! This is a feature release with many new features, bug fixes, and performance improvements. You can view the release notes and download the source code or binaries here.

Charm++/AMPI BoF at SC18

09/18/2018

PPL will be presenting a Birds of a Feather session on Charm++ and AMPI at SuperComputing 2018 in Dallas on Wednesday, November 14th from 12:15 PM to 1:15 PM. The BoF is a gathering of users and developers of Charm++, as well as those interested in hearing more about it. The agenda will be posted here when finalized, but expect to hear from users about their experiences with Charm++ and AMPI and from developers highlighting new features and possible future directions for Charm++ and AMPI.

Charm++ Release 6.8.2

10/26/2017

This is a backwards-compatible patch/bug-fix release, containing just a few changes. The primary improvements are

Fix a crash in SMP builds on the OFI network layer
Improve performance of the PAMI network layer on POWER8 systems by adjusting message-size thresholds for different protocols

The code can be found in our Git repository as tag 'v6.8.2' or in a tarball

Bilge Acun successful defense and awards

10/16/2017

In August, Bilge successfully defended her dissertation titled "Mitigating Variability in HPC Systems and Applications for Performance and Power Efficiency". At SC17, Bilge will be presenting in the Doctoral Showcase. Presentation on Tue. Nov. 14 at 1:30pm and Poster on Tue. Nov. 14 at 5:15pm.

She was also recently recognized as a Rising Star in EECS 2017 by Stanford University
“Rising Stars brings top graduate and postdoc women in EECS together for scientific discussions and informal sessions aimed at navigating the early stages of an academic career.”
and she received the Kenichi Miura Award by University of Illinois at Urbana-Champaign Department of Computer Science.
"Established in 2011 by Dr. Miura (MS ‘71, PhD ‘73), this award honors a graduate student for excellence in High Performance Computing."

Charm++ Release 6.8.1

10/13/2017

What's new in Charm++ 6.8.1

This is a backwards-compatible patch/bug-fix release. Roughly 100 bug fixes, improvements, and cleanups have been applied across the entire system. Notable changes are described below:

General System Improvements

Enable network- and node-topology-aware trees for group and chare array reductions and broadcasts
Add a message receive 'fast path' for quicker array element lookup
Feature #1434: Optimize degenerate CkLoop cases
Fix a rare race condition in Quiescence Detection that could allow it to fire prematurely (bug #1658)
- Thanks to Nikhil Jain (LLNL) and Karthik Senthil for isolating this in the Quicksilver proxy application
Fix various LB bugs
1. Fix RefineSwapLB to properly handle non-migratable objects
2. GreedyRefine: improvements for concurrent=false and HybridLB integration
3. Bug #1649: NullLB shouldnt wait for LB period
Fix Projections tracing bug #1437: CkLoop work traces to the previous entry on the PE rather than to the caller
Modify [aggregate] entry method (TRAM) support to only deliver PE-local messages inline for [inline]-annotated methods. This avoids the potential for excessively deep recursion that could overrun thread stacks.
Fix various compilation warnings

Platform Support

Improve experimental support for PAMI network layer on POWER8 Linux platforms
- Thanks to Sameer Kumar of IBM for contributing these patches
Add an experimental 'ofi' network layer to run on Intel Omni-Path hardware using libfabric
- Thanks to Yohann Burette and Mikhail Shiryaev of Intel for contributing this new network layer
The GNI network layer (used on Cray XC/XK/XE systems) now respects the ++quiet command line argument during startup

AMPI Improvements

Support for MPI_IN_PLACE in all collectives and for persistent requests
Improved Alltoall(v,w) implementations
AMPI now passes all MPICH-3.2 tests for groups, virtual topologies, and infos
Fixed Isomalloc to not leave behind mapped memory when migrating off a PE

Charm++ Release 6.8.0

08/16/2017

Over 900 commits (bugfixes + improvements + cleanups) have been applied across the entire system. Major changes are described below:

Charm++ Features

Calls to entry methods taking a single fixed-size parameter can now automatically be aggregated and routed through the TRAM library by marking them with the [aggregate] attribute.
Calls to parameter-marshalled entry methods with large array arguments can ask for asynchronous zero-copy send behavior with a 'nocopy' tag in the parameter's declaration.
The runtime system now integrates an OpenMP runtime library so that code using OpenMP parallelism will dispatch work to idle worker threads within the Charm++ process.
Applications can ask the runtime system to perform automatic high-level end-of-run performance analysis by linking with the '-tracemode perfReport' option.
Added a new dynamic remapping/load-balancing strategy, GreedyRefineLB, that offers high result quality and well bounded execution time.
Improved and expanded topology-aware spanning tree generation strategies, including support for runs on a torus with holes, such as Blue Waters and other Cray XE/XK systems.
Charm++ programs can now define their own main() function, rather than using a generated implementation from a mainmodule/mainchare combination. This extends the existing Charm++/MPI interoperation feature.
Improvements to Sections:

Array sections API has been simplified, with array sections being automatically delegated to CkMulticastMgr (the most efficient implementation in Charm++). Changes are reflected in Chapter 14 of the manual.
Group sections can now be delegated to CkMulticastMgr (improved performance compared to default implementation). Note that they have to be manually delegated. Documentation is in Chapter 14 of Charm++ manual.
Group section reductions are now supported for delegated sections via CkMulticastMgr.
Improved performance of section creation in CkMulticastMgr.
CkMulticastMgr uses the improved spanning tree strategies. See above.

GPU manager now creates one instance per OS process and scales the pre-allocated memory pool size according to the GPU memory size and number of GPU manager instances on a physical node.
Several GPU Manager API changes including:

Replaced references to global variables in the GPU manager API with calls to functions.
The user is no longer required to specify a bufferID in dataInfo struct.
Replaced calls to kernelSelect with direct invocation of functions passed via the work request object (allows CUDA to be built with all programs).

Added support for malleable jobs that can dynamically shrink and expand the set of compute nodes hosting Charm++ processes.
Greatly expanded and improved reduction operations:

Added built-in reductions for all logical and bitwise operations on integer and boolean input.
Reductions over groups and chare arrays that apply commutative, associative operations (e.g. MIN, MAX, SUM, AND, OR, XOR) are now processed in a streaming fashion. This reduces the memory footprint of reductions. User-defined reductions can opt into this mode as well.
Added a new 'Tuple' reducer that allows combining multiple reductions of different input data and operations from a common set of source objects to a single target callback.
Added a new 'Summary Statistics' reducer that provides count, mean, and standard deviation using a numerically-stable streaming algorithm.

Added a '++quiet' option to suppress charmrun and charm++ non-error messages at startup.
Calls to chare array element entry methods with the [inline] tag now avoid copying their arguments when the called method takes its parameters by const&, offering a substantial reduction in overhead in those cases.
Synchronous entry methods that block until completion (marked with the [sync] attribute) can now return any type that defines a PUP method, rather than only message types.

AMPI Features

More efficient implementations of message matching infrastructure, multiple completion routines, and all varieties of reductions and gathers.
Support for user-defined non-commutative reductions, MPI_BOTTOM, cancelling receive requests, MPI_THREAD_FUNNELED, PSCW synchronization for RMA, and more.
Fixes to AMPI's extensions for load balancing and to Isomalloc on SMP builds.
More robust derived datatype support, optimizations for truly contiguous types.
ROMIO is now built on AMPI and linked in by ampicc by default.
A version of HDF5 v1.10.1 that builds and runs on AMPI with virtualization is now available at https://charm.cs.illinois.edu/gerrit/#/admin/projects/hdf5-ampi
Improved support for performance analysis and visualization with Projections.

Platforms and Portability

The runtime system code now requires compiler support for C++11 R-value references and move constructors. This is not expected to be incompatible with any currently supported compilers.
The next feature release (anticipated to be 6.9.0 or 7.0) will require full C++11 support from the compiler and standard library.
Added support for IBM POWER8 systems with the PAMI communication API, such as development/test platforms for the upcoming Sierra and Summit supercomputers at LLNL and ORNL. Contributed by Sameer Kumar of IBM.
Mac OS (darwin) builds now default to the modern libc++ standard library instead of the older libstdc++.
Blue Gene/Q build targets have been added for the 'bgclang' compiler.
Charm++ can now be built on Cray's CCE 8.5.4+.
Charm++ will now build without custom configuration on Arch Linux
Charmrun can automatically detect rank and node count from Slurm/srun environment variables.

The complete list of issues that have been merged/resolved in 6.8.0 can be found here. The associated git commits can be viewed here.

Charm++ Tutorial at HPCCamp 2017 - ECAR 2017

06/21/2017

High Performance Computing Camp - Escuela de Computación de Alto Rendimiento

Techniques and methodology for parallel programming - Module 4: Programming with parallel objects

Rescheduled and is now Sep 18-29, 2017, Buenos Aires, Argentina

Detailed Program:

Day 1: Parallel Objects Programming Fundamentals Introduction to basic concepts: overdecomposition, asynchrony, migratability and adaptivity. The parallel objects model and its advantages over traditional methods. Introduction to Charm++ programming language. Charm++ programming and execution model. Installation of Charm++ and associated libraries. Basic Charm++ code samples. Use and properties of chare arrays.

Day 2: Performance Analysis and Load Balancing Introduction to Projections, a performance analysis tool. Visualizing executions and analysis of experimental results. Performance bottleneck detection. Introduction to load balancing. Object migration and PUP methods. Load balancing strategies in Charm++. Use of different load balancing strategies for particular problems.

Day 3: Advanced Programming with Charm++ Advanced programming mechanisms in Charm++. Multidimensional array usage and chare groups. Introduction to checkpointing and its applications.

Day 4: High Level Programming with Charm++ Introduction to Structured Dagger (SDAG), a tool for high-level programming in Charm++. Survey of other high-level languages in the Charm++ ecosystem. Presentation of real applications using Charm++.

Beta release Charm++ 6.8.0.

04/06/2017

Hello everyone!

We're pleased to announce a beta release of Charm++ in advance of the upcoming version 6.8.0. We ask that users take this opportunity to test the latest code with their applications and report any issues encountered.

The code for this release can be obtained by

git clone https://charm.cs.illinois.edu/gerrit/charm.git

git checkout v6.8.0-beta2

(Beta 1 was not announced due to bugs found in internal testing)

We have also posted corresponding updated Java binaries of Projections and CharmDebug

Among over 700 commits made since the release of version 6.7.1, some of the larger and more exciting improvements in the system include:

Calls to entry methods taking a single fixed-size parameter can now automatically be aggregated and routed through the TRAM library by marking them with the [aggregate] attribute.
Calls to parameter-marshalled entry methods with large array arguments can ask for asynchronous zero-copy send behavior with an 'rdma' tag in the parameter's declaration.
The runtime system now integrates an OpenMP runtime library so that code using OpenMP parallelism will dispatch work to idle worker threads within the Charm++ process.
Applications can ask the runtime system to perform automatic high-level end-of-run performance analysis by linking with the ' -tracemode perfReport' option.
Added a new dynamic remapping/load-balancing strategy, GreedyRefineLB, that offers high result quality and well bounded execution time.
Charm++ programs can now define their own main() function, rather than using a generated implementation from a mainmodule/mainchare combination. This extends the existing Charm++/MPI interoperation feature.
GPU manager now creates one instance per OS process and scales the pre-allocated memory pool size according to the GPU memory size and number of GPU manager instances on a physical node.
Several GPU Manager API changes including:

Replaced references to global variables in the GPU manager API with calls to functions.
The user is no longer required to specify a bufferID in dataInfo struct.
Replaced calls to kernelSelect with direct invocation of functions passed via the work request object (allows CUDA to be built with all programs).

Added support for malleable jobs that can dynamically shrink and expand the set of compute nodes hosting Charm++ processes.
Greatly expanded and improved reduction operations:

Added built-in reductions for all logical and bitwise operations on integer and boolean input.
Reductions over groups and chare arrays that apply commutative, associative operations (e.g. MIN, MAX, SUM, AND, OR, XOR) are now processed in a streaming fashion. This reduces the memory footprint of reductions. User-defined reductions can opt into this mode as well.
Added a new 'Tuple' reducer that allows combining multiple reductions of different input data and operations from a common set of source objects to a single target callback.
Added a new 'Summary Statistics' reducer that provides count, mean, and standard deviation using a numerically-stable streaming algorithm.

Added a '++quiet' option to suppress charmrun and charm++ non-error messages at startup.
Calls to chare array element entry methods with the [inline] tag now avoid copying their arguments when the called method takes its parameters by const&, offering a substantial reduction in overhead in those cases.
Synchronous entry methods that block until completion (marked with the [sync] attribute) can now return any type that defines a PUP method, rather than only message types.
Improved and expanded topology-aware spanning tree generation strategies, including support for runs on a torus with holes, such as Blue Waters and other Cray XE/XK systems.

Future portability/compatibility note:

Please be aware that all feature releases of the Charm++ system following the final 6.8 will require full C++11 support from the compiler and standard library in use.

NAMD 1 of 10 selected for the Aurora Early Science Program

02/16/2017

Aurora Early Science Program

NAMD is 1 of 10 computational science and engineering research projects that were selected for the ALCF Aurora Early Science Program. Aurora is expected to arrive in 2018 and will be a massively parallel, manycore Intel-Cray supercomputer. For more information about this program, click here.

The project "Free energy landscapes of membrane transport proteins" will be using NAMD and is lead by Benoit Roux, The University of Chicago, in collaboration with NIH Center for Macromolecular Modeling and Bioinformatics, Beckman Institute, The University of Illinois.

Xiang Ni defended her dissertation

02/14/2017

Last summer, Xiang Ni successfully defended her thesis titled “Mitigation of Failures in High Performance Computing via Runtime Techniques”. Her thesis leverages runtime system and compiler techniques to mitigate a significant fraction of failures automatically with low overhead. The main goals of various system-level fault tolerance strategies designed in this thesis are: reducing the extra cost added to application execution while improving system reliability; automatically adjusting fault tolerance decisions without user intervention based on environmental changes; protecting applications not only from fail-stop failures but also from silent data corruptions.

Harshitha defended her dissertation

12/06/2016

Harshitha Menon successfully defended her thesis titled ‘Adaptive Load Balancing for HPC Applications’ over the summer. Her thesis addresses load imbalance problem in HPC applications. The thesis presents various load balancing algorithms and the use of adaptive runtime techniques along with machine learning to automate the load balancing decisions. Abstract.

PPL and Charm++ at SC16

11/30/2016

Click to see our webpage for details

Phil Miller defended his dissertation

11/29/2016

Over the summer, Philip Miller successfully defended his dissertation titled "Reducing Synchronization in Distributed Parallel Programs". Abstract

Charm++ power, resilience work featured in IEEE computer Oct ’16 issue!

10/20/2016

Over the years PPL has carried out research on multiple aspects of optimizing power, energy, temperature without sacrificing performance. Automatic runtime adaptation through the Charm++ runtime system has been a key foundation to all the approaches explored. Resilience related solutions are also enabled by the same runtime approach. A broad summary of our research and its connection with adaptive runtime was published recently in IEEE Computer.

Link to the web article: Power, Reliability, Performance: One System to Rule Them All [IEEE Computer October 2016]

Charm++ Release 6.7.1

04/20/2016

Changes in this release are primarily bug fixes for 6.7.0. The major exception is AMPI, which has seen changes to its extension APIs and now complies with more of the MPI standard. A brief list of changes follows:

Charm++ Bug Fixes

Startup and exit sequences are more robust
Error and warning messages are generally more informative
CkMulticast’s set and concat reducers work correctly

AMPI Features

AMPI’s extensions have been renamed to use the prefix AMPI_ instead of MPI_ and to generally follow MPI’s naming conventions
AMPI_Migrate(MPI_Info) is now used for dynamic load balancing and all fault tolerance schemes (see the AMPI manual)
AMPI officially supports MPI-2.2, and also implements the non-blocking collectives and neighborhood collectives from MPI-3.1

Platforms and Portability

Cray regularpages build target has been fixed
Clang compiler target for BlueGene/Q systems added
Comm. thread tracing for SMP mode added
AMPI’s compiler wrappers are easier to use with autoconf and cmake

Jonathan Lifflander defends his dissertation

03/15/2016

Jonathan Lifflander successfully defended his dissertation entitled "Optimizing Work Stealing Algorithms with Scheduling Constraints". His thesis examines methodologies to improve the efficiency of fork--join programming models in conjunction with work stealing schedulers by exploiting persistency in iterative scientific benchmarks. His thesis demonstrates a highly scalable implementation of distributed-memory work stealing using a novel tracing framework to record task execution locations in the presence of random steals, while incurring very low overheads. This same tracing framework is used to optimize work stealing on NUMA architectures. Finally, by introducing data effect annotations to fork--join models in conjunction with runtime tracing, his work enables fork--join schedulers to execute ahead of syncs to accrue cache locality benefits.

Nikhil defends his dissertation

02/01/2016

PPLer Nikhil Jain has successfully defended his dissertation titled "Optimization of Communication Intensive Applications on HPC Networks". In a hour long public presentation given to his thesis committee, which consists of Illinois Professors Kale, Gropp, Torrellas, and OSU Prof. Panda, Nikhil described the importance of communication in HPC applications and presented his two step approach for optimizing it on HPC networks. Use of machine learning to perform diagnostic studies that can help identify important metrics forms the first step. The second step is to use parallel discrete event simulation tools developed based on learning from the first step for mimicking communication flow on HPC networks. The thesis presents a few example use cases of these tools by comparing HPC networks with different topologies and by predicting the impact of changes in network parameters. In addition to this methodology, the thesis also contains work on topology aware mapping, job placement, and communication algorithms. More details on Nikhil’s research and his thesis can be found at his personal home page.

Charm++ and AMPI BoF at SC15

11/19/2015

Link to info

PPL at SC15

11/15/2015

See our webpage for details

Charm++ Tutorial at SBAC-PAD 2015

09/15/2015

Celso, Laércio and Esteban will present the Charm++ tutorial at the 27th annual SBAC-PAD on October 21st in Santa Catarina, Brazil. Link to info

Akhil Langer receives Kenichi Miura Award 2015

04/30/2015

PPLer Akhil Langer has received the 2015 Kenichi Miura Award. This award honors a graduate student for outstanding accomplishments in High Performance Computing. Akhil works with Prof Laxmikant Kale and Prof Udatta Palekar on several aspects of high performance computing including power and energy optimizations, stochastic optimization, load balancing, adaptive mesh refinement. Akhil's thesis work provides a computational-engine for many real-time and dynamic problems faced by US Air Mobility Command. It is expected that this work will provide the springboard for more robust problem solving with HPC in many logistics and planning problems.

Charm++ tutorial in Brazil

02/19/2015

Laércio Lima Pilla is leading a Charm++ tutorial as part of a regional gathering on HPC: http://erad2015.inf.ufrgs.br/minicursos.html, April 22-24, 2015. Laércio is a former student of Prof. Navaux (Federal University of Rio Grande do Sul) and Prof. Mehaut (University of Grenoble). He is now an associate professor at the University of Santa Catarina in Brazil.

The Coding Illini team with PPLer Phil Miller win 2014 PUCC

12/01/2014

Link to info

PPL @ SC'14

11/16/2014

Link to info

Lifflander et al. Win Best Student Paper at CLUSTER'14

09/23/2014

Jonathan Lifflander, Esteban Meneses, Harshitha Menon, Phil Miller, Sriram Krishnamoorthy, and Laxmikant V. Kale have won the best student paper award at CLUSTER'14 in Madrid, Spain!

This was awarded for their fault-tolerance paper that describes a new theoretical model for dependencies that reduces the amount of data required to perform deterministic replay. Using the algorithm presented, we demonstrate 2x better performance and scalability up to 128k cores of BG/P `Intrepid'. The paper is entitled: Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance.

Live Webcast 15th Annual Charm++ Workshop

What's new in Charm++ 6.8.1

High Performance Computing Camp - Escuela de Computación de Alto Rendimiento

Detailed Program:

Hello everyone!

The code for this release can be obtained by

Future portability/compatibility note:

Aurora Early Science Program

Charm++ Bug Fixes

AMPI Features

Platforms and Portability